The business analytics group of a company is asked to investigate causes of malfunctions in technological process of one of the manufacturing plants that result in significant increase of cost for the end product of the business. One of suspected reasons for malfunctions is deviation of temperature during the technological process from optimal levels. The sample in the provided file contains times of malfunctions in seconds since the start of measurement and minute records of temperature.
Increases in production costs affect a company’s bottom line. Production costs can increase because of factors a company cannot control, such as a spike in prices for a product’s inputs. But in this case, the company suspects cost increase comes from inefficiencies in its production process. Such inefficiencies increase variable costs and erode a company’s margins if not handled in a timely manner.
It’s our job to understand the company’s production process and investigate what’s causing cost increases.
Let’s load the data into a dataframe and explore what we are working with.
# Load the CSV data
Course.Project.Data<-read.csv("Final_Project.csv")
head(Course.Project.Data, 10)
## Time Temperature
## 1 18.08567 91.59307
## 2 28.74417 91.59307
## 3 34.23941 91.59307
## 4 36.87944 91.59307
## 5 37.84399 91.59307
## 6 41.37885 91.59307
## 7 45.19283 91.59307
## 8 60.94242 97.30860
## 9 66.33539 97.30860
## 10 69.95667 97.30860
summary.stats <- function(d){
# create a local copy for summary stats
d.t <- d
# total number of events and number of unique temperatures
d.t$Changes <- c(1, ifelse(diff(d.t$Temperature)!=0,1,0))
num.temp.changes <- sum(d.t$Changes)
num.events <- length(d.t$Time)
unique.temps <- length(unique(d.t$Temperature))
# number of events relative to changes in temperature
num.events.vs.temps <- num.events / num.temp.changes
# time elapsed
time.elapsed <- tail(d.t,1)$Time - head(d.t, 1)$Time
time.elapsed.min <- time.elapsed / 60
time.elapsed.hr <- time.elapsed / 3600
min.time.temp.chg <- mean(diff(d.t[which(d.t$Changes==1),]$Time))
# collect and name sum stats
sum.stats <- c(num.events,num.temp.changes,unique.temps,
num.events.vs.temps,time.elapsed.hr,time.elapsed.min,
time.elapsed / num.events,min.time.temp.chg)
sum.stats.names <- c("number of malfunctions", "number of temp changes",
"unique temp changes", "malfunction temp ratio",
"time elapsed (hr)", "time elapsed (min)",
"avg time elapsed b/w malfunction",
"avg elapsed time b/w temp chg")
#return results as dataframe
return(data.frame(list(stat.name=sum.stats.names,
stat=round(sum.stats,3))))
}
summary.stats(Course.Project.Data)
## stat.name stat
## 1 number of malfunctions 3054.000
## 2 number of temp changes 242.000
## 3 unique temp changes 242.000
## 4 malfunction temp ratio 12.620
## 5 time elapsed (hr) 4.162
## 6 time elapsed (min) 249.698
## 7 avg time elapsed b/w malfunction 4.906
## 8 avg elapsed time b/w temp chg 61.971
The dataset captures the time of each malfunction and the temperature when the malfunction occurred. The first 10 records indicate that malfunctions can occur at any time, but temperature seems fixed per minute. Any \(n\) number of malfunctions within 60 seconds have the same temperature. The results from the “summary stats” function confirm this behavior. Malfuncions occur, on average, once every 5 seconds, while temperature changes once every ~60 seconds.
The summary stats function provides additional color. It shows the number of malfunctions in the dataset and the time range in which they occur. The company suspects that temperature deviations and malfunctions are linked. From the results above, we do not yet know the problem, but it seems plausible that the temperature change time lag may be a factor.
To investigate further, let’s visualize the relationship between time and the cumulative number of malfunctions.
Cum.Count <- cbind(Time=Course.Project.Data$Time,Count=1:length(Course.Project.Data$Time))
Counting.Process<-as.data.frame(Cum.Count)
head(Counting.Process)
## Time Count
## 1 18.08567 1
## 2 28.74417 2
## 3 34.23941 3
## 4 36.87944 4
## 5 37.84399 5
## 6 41.37885 6
plot(Counting.Process$Time,Counting.Process$Count,type="s")
What does it tell you about the character of malfunctions and the reasons causing them?
The line plot shows a steady, constant growth in the number of malfunctions over time. In smaller intervals, malfunctions tend to wander along the hypothetical straight line that represents the trend’s slope. Some periods see spikes in the number of malfunctions, while others extend with no malfunctions at all. But overall, malfunctions occur approximately once every 5 seconds and grow linearly with time.
This behavior suggests the malfunctions follow a poisson proess. A poisson process is a set of random variables capturing the evolution of a system over time. If our process is poisson, malfunctions occur independently and do not increase or decrease the likelihood of another. Additionally, the mean intensity of malfunctions is constant. It may vary in small intervals but always reverts back to its mean rate.
Of particular interest are the time between malfunctions and the malfunction count per unit of time. If we find statistical distributions for each, we understand the process generating malfunctions. We can predict when malfunctions will arrive and how often they will occur in fixed intervals.
This plot above simply visualizes the growth in malfunctions over time, so we cannot make any conclusions about the nature of malfunctions yet. We must fit distributions to our process, analyze their diagnostics, and compare the likelihood that each produced the data we observe.
Cumulative intensity is calculated as \(\Lambda(t)=\frac{N_t}{t}\), where \(N_t\) is the number of events during the time interval \([0,t]\). For our data \(t\) is the sequence of time stamps and \(N_t\) is the count up until \(t\).
In this case, cumulative intensity demonstrates how frequently the malfunctions occur. A constant rate suggests a linear relationship and independence between malfunctions. An increasing rate indicates a time-dependent relationship, and one in which malfunctions may increase the chance of future malfunctions.
# cumulative intensity calculation
Cum.Intensity <- Counting.Process$Count/Counting.Process$Time
# plots using the Cum.Intensity Variable
plot(Counting.Process$Time,Cum.Intensity,type="l",ylab="Cumulative Intensity")
abline(h=Cum.Intensity[length(Cum.Intensity)], col = "blue")
abline(h=mean(Cum.Intensity), col = "red")
# FOR REFERENCE - example of the FIRST intensity
abline(h=Cum.Intensity[1], col = "grey")
The plot above displays the cumulative intensity as a function of time. This line appears in black. I’ve added three additional lines:
They horizontal grey line shows the cumulative intensity set by the first malfunction. This malfunction came in at roughly 5%, meaning the first malfunction happened 20 seconds into the process. As expected, cumulative intensity varies wildly in the beginning of the process. The smaller the sample, the greater effect each malfunction has on the cumulative intensity. Eventually, the cumulative intensity converges at a stable rate. Convergence occurs between the last cumulative intensity value and the mean of the cumulative intensity over the entire process. The blue and red lines represent these values, respectively. The actual values are shown below:
c(First.Intensity=Cum.Intensity[1],
Last.Intensity=Cum.Intensity[length(Cum.Intensity)],
Mean.Intensity=mean(Cum.Intensity))
## First.Intensity Last.Intensity Mean.Intensity
## 0.05529239 0.20360077 0.20953048
It’s important to note that convergence suggests stability but may not be ideal depending on the rate. Unless the rate at which malfunctions occur converges below a permissible rate, the company’s costs remain above desired levels. We do not have information about an acceptable rate, but ~0.2 (the convergence rate) seems pretty high. This rate means a malfunction occurs once every 5 seconds.
For now, we’ll return to the process itself and reason about the its distribution.
Let’s look at the first 30 rows. We use 30 rows to provide context for the highlighted note in the assignment, which appears below this table.
head(Course.Project.Data,30)
## Time Temperature
## 1 18.08567 91.59307
## 2 28.74417 91.59307
## 3 34.23941 91.59307
## 4 36.87944 91.59307
## 5 37.84399 91.59307
## 6 41.37885 91.59307
## 7 45.19283 91.59307
## 8 60.94242 97.30860
## 9 66.33539 97.30860
## 10 69.95667 97.30860
## 11 76.17420 97.30860
## 12 80.48524 97.30860
## 13 81.29133 97.30860
## 14 86.18149 97.30860
## 15 91.28642 97.30860
## 16 91.75162 97.30860
## 17 98.29452 97.30860
## 18 142.58741 95.98865
## 19 149.82484 95.98865
## 20 151.58587 95.98865
## 21 156.37781 95.98865
## 22 161.97298 95.98865
## 23 172.42610 95.98865
## 24 174.79452 95.98865
## 25 184.07435 100.38440
## 26 209.82744 100.38440
## 27 223.35757 100.38440
## 28 230.02632 100.38440
## 29 252.39556 99.98330
## 30 309.39009 102.54126
Note that the first 7 rows (events) occurred during the first minute. The temperature measurement for the first minute was 91.59307°F. The following 10 rows happen during the second minute and the second minute temperature is 97.3086°F. The third minute had 7 events at temperature 95.98865°F. The fourth minute had 4 events at 100.3844°F. And the following fifth minute had only 1 event at 99.9833°F.
Because temperature changes once every minute on average, we calculate one-minute malfunction counts as well. First, we must check to see if there are malfunctions every minute. Minutes without a malfunction cannot be ignored - we need to record those minutes with a count of 0 and a temperature of NULL or NA.
get.missing <- function(d){
# create a local copy
d.t <- d
# find the minute representation of each, and the diff in minutes
d.t$Minute.Rep <- floor(d.t$Time/60) + 1
d.t$Minute.Diff <- c(0,diff(d.t$Minute.Rep))
# get the examples where a minute is skipped (diff = 2)
mm.ind <- which(d.t$Minute.Diff==2)
mm.v <- c(mm.ind - 1, mm.ind)
mm.e <- d.t[mm.v,]
# show those examples and return the dataframe
mm.e <- mm.e[order(mm.e$Minute.Rep),]
missing.minutes <- d.t[mm.ind,"Minute.Rep"] - 1
# return missing list and df example
return(list(
missing.minutes.example = mm.e,
missing.minutes = missing.minutes
))
}
(missing.min <- get.missing(Course.Project.Data))
## $missing.minutes.example
## Time Temperature Minute.Rep Minute.Diff
## 1420 6354.872 98.45212 106 0
## 1421 6421.328 101.68017 108 2
## 2168 10079.018 101.60903 168 0
## 2169 10141.876 99.74646 170 2
## 2314 11085.663 98.49842 185 0
## 2315 11162.077 92.69665 187 2
## 2330 11279.821 97.43224 188 0
## 2331 11356.770 91.87740 190 2
## 2382 11690.702 96.65583 195 0
## 2383 11761.434 106.68515 197 2
## 2486 12057.477 101.18516 201 0
## 2487 12121.915 106.81680 203 2
## 2911 13917.449 94.70082 232 0
## 2912 14002.761 99.88177 234 2
## 3045 14875.000 96.45936 248 0
## 3046 14953.108 103.06171 250 2
##
## $missing.minutes
## [1] 107 169 186 189 196 202 233 249
The “get.missing” function returns the minutes for which there are no malfunctions. Additionally, the function prints a dataframe to the console to demonstrate what a “missing” minute looks like. The first minute with no malfunctions arrives at 107, or 1 hour and 47 minutes into the process. The next minute with no malfunctions appears 60 minutes later. We must inject rows into our “one-minute count” dataframe, as a minute with no malfunctions should not be discarded; instead, its count should be zero.
one.minute.count <- function(d,m){
# great grouped object with appropriate columns
minute.mid <- function(i) (i * 60) - 30
df.group <- d %>%
mutate(
Minute.Rep = floor(Time/60) + 1,
Minute.times = minute.mid(Minute.Rep)
) %>%
dplyr::group_by(Minute.times) %>%
dplyr::summarise(Minute.counts = n(),
Minute.Temps = first(Temperature))
# add in the missing rows
missing.df<-data.frame(
list(Minute.times = minute.mid(m$missing.minutes),
Minute.counts = rep(0,length(m$missing.minutes)),
Minute.Temps = rep(NA,length(m$missing.minutes))
)
)
# return the result as a dataframe
res.df <- as.data.frame(rbind(df.group,missing.df)) %>% arrange(Minute.times)
return(res.df)
}
One.Minute.Counts.Temps <- one.minute.count(Course.Project.Data, missing.min)
head(One.Minute.Counts.Temps,10)
## Minute.times Minute.counts Minute.Temps
## 1 30 7 91.59307
## 2 90 10 97.30860
## 3 150 7 95.98865
## 4 210 4 100.38440
## 5 270 1 99.98330
## 6 330 6 102.54126
## 7 390 23 104.29037
## 8 450 3 94.77298
## 9 510 24 104.86556
## 10 570 20 103.43659
The “one.minute.count” function returns a dataframe with the number of malfunctions per minute and the temperature recorded for each malfunction during that minute. The “Minute.times” column represents the mid point in seconds for each minute.
The plot below charts the number of counts in one minute (y-axis) against the midpoint of each minute.
plot(One.Minute.Counts.Temps$Minute.times,One.Minute.Counts.Temps$Minute.counts)
In this section, we test our dataset for overdispersion. Overdispersion arises when the observed data is more dispersed than a given model permits. This problem generally arises because the model assumes a distribution for which a functional relationship between the mean and the variance exist, and that relationship limits the model’s ability to capture the true variance in the response.
As an example, a Poisson distribution’s mean is equal to its variance. Overdispersion would result if a model assumes a Poisson-distributed response but the observed data’s variability is larger than its mean. Overdispersion in models generates misleading results, underestimated error rates, and leads to unsafe conclusions. Therefore, it’s important that we detect overdisperson from the onset and avoid distributions that lead to overdispersion in our models.
Diagnosing overdispersion is not straightforward. Multiple methods exist, so we will explore some to detect overdispersion in our data.
Look at the output of glm() and compare the residual deviance with the number of degrees of freedom. If the assumed model is correct deviance is asymptotically distributed as Chi-squared \(X^2\) with degrees of freedom \(n-k\) where \(n\) is the number of observations and \(k\) is the number of parameters. For Chi-squared distribution the mean is the number of degrees of freedom \(n-k\). If the residual deviance returned by glm() is greater than \(n-k\) then it might be a sign of over-dispersion.
Test the method on simulated Poisson data.
Test.Deviance.Overdispersion.Poisson<-function(Sample.Size,Parameter.Lambda){
my.Sample<-rpois(Sample.Size,Parameter.Lambda)
Model<-glm(my.Sample~1,family=poisson)
Dev<-Model$deviance
Deg.Fred<-Model$df.residual
(((Dev/Deg.Fred-1)/sqrt(2/Deg.Fred)>-1.96)&((Dev/Deg.Fred-1)/sqrt(2/Deg.Fred)<=1.96))*1
}
Test.Deviance.Overdispersion.Poisson(100,1)
## [1] 1
The function simulates a sample from Poisson distribution, estimates parameter \(\lambda\) which is simultaneously the mean value and the variance, then it checks if \(\frac{Deviance}{Deg.Freedom} - 1\) belongs to the interval \((-1.96,1.96]\). If yes, the result is 1. Otherwise it is 0.
Now repeat the call of the function 300 times to see how many times it returns one and how many times zero.
The estimate of the parameter \(\lambda\) given by glm() is \(e^{Coefficient}\):
set.seed(19)
# replicate and get value
sum(replicate(300,Test.Deviance.Overdispersion.Poisson(100,1)))
## [1] 265
# get the coefficient
exp(glm(rpois(1000,1)~1,family=poisson)$coeff)
## (Intercept)
## 1.005
The quick and dirty method says 265/300 Possion samples do not have overdispersion. These results suggest that the method detects overdispersion when it does not exist, so we have some false positives. This rate is still very high, and its errors simply make us more cautious. In this case, detecting overdispersion when it does not exist is better than failing to detect overdispersion when it is truly present.
The estimate of the parameter \(\lambda\) is 1.005. Because this is a null model, the estimate measures the mean intensity. We created 1000 random variables with true \(\lambda\) = 1, so it makes sense that our estimate is close to \(\lambda\). We have to exponentiate the coefficients because the model returns them on a log-scale, given that the poisson regression makes a multiplicative relationship linear.
Perform the same test on negative binomial data
Test.Deviance.Overdispersion.NBinom<-function(Sample.Size,Parameter.prob){
my.Sample<-rnbinom(Sample.Size,2,Parameter.prob)
Model<-glm(my.Sample~1,family=poisson)
Dev<-Model$deviance
Deg.Fred<-Model$df.residual
(((Dev/Deg.Fred-1)/sqrt(2/Deg.Fred)>-1.96)&((Dev/Deg.Fred-1)/sqrt(2/Deg.Fred)<=1.96))*1
}
sum(replicate(300,Test.Deviance.Overdispersion.NBinom(100,.2)))
## [1] 0
We see that the over-dispersed negative binomial distribution sample never passes the test. Therefore, we do not have any false negatives, where there is over dispersion we are unable to detect. This performance is desirable.
Now apply the test to the one-minute event counts.
GLM.model<-glm(One.Minute.Counts.Temps$Minute.counts~1,family=poisson)
summary(GLM.model)
##
## Call:
## glm(formula = One.Minute.Counts.Temps$Minute.counts ~ 1, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9429 -2.3450 -0.8103 1.4815 7.1961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5027 0.0181 138.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1799 on 249 degrees of freedom
## Residual deviance: 1799 on 249 degrees of freedom
## AIC: 2789.3
##
## Number of Fisher Scoring iterations: 5
Do you see signs of over-dispersion?
From the regression output above, we see that residual deviance is 1799 and the degrees of freedom is 249. In this case, the model has much higher residual deviance than the degrees of freedom. This discrepancy strongly suggests our model has overdispersion. We can modify the test deviance function written above and apply it to our model:
Test.Deviance.Overdispersion.Counts<-function(Model){
Dev<-Model$deviance
Deg.Fred<-Model$df.residual
(((Dev/Deg.Fred-1)/sqrt(2/Deg.Fred)>-1.96)&((Dev/Deg.Fred-1)/sqrt(2/Deg.Fred)<=1.96))*1
}
Test.Deviance.Overdispersion.Counts(GLM.model)
## [1] 0
The model fails the “quick and dirty” test. As a result, we conclude there is definitely overdispersion present in the dataset. We continue to explore overdispersion with two more robust tests.
The test implemented in AER is described in Cameron, A.C. and Trivedi, P.K. (1990). Regression-based Tests for Overdispersion in the Poisson Model. Journal of Econometrics, 46, 347–364.
In a Poisson model, the mean is \(E(Y)=\lambda\) and the variance is \(V(Y)=\lambda\) as well. They are equal. The test has a null hypothesis \(c=0\) where \(Var(Y)=\lambda+c∗f(\lambda)\), \(c<0\) means under-dispersion and \(c>0\) means over-dispersion. The function f(.) is some monotonic function (linear (default) or quadratic). The test statistic used is a t statistic which is asymptotically standard normal under the null.
Learn how to use dispersiontest() from AER and apply it to GLM.model that we fit earlier:
Disp.Test <- dispersiontest(object = GLM.model, alternative = 'two.sided')
Disp.Test
##
## Dispersion test
##
## data: GLM.model
## z = 8.5747, p-value < 2.2e-16
## alternative hypothesis: true dispersion is not equal to 1
## sample estimates:
## dispersion
## 7.377975
Does the test show overdispersion?
The dispersiontest tests the null hypothesis of equidispersion in Poisson GLMs against the alternative hypothesis for overdispersion and/or underdispersion. In this case, the p-value is extremely small, so we reject the null hypothesis of equidispersion. Because the dispersion parameter is ~7.37, we conclude that the dispersion is in fact overdispersion. This result is in line with what we found in the first “quick and dirty” method. It provides more evidence for overdispersion.
We’ll conclude with one more test.
GLM.model.nb <- glm.nb(One.Minute.Counts.Temps$Minute.counts ~ 1)
summary(GLM.model.nb)
##
## Call:
## glm.nb(formula = One.Minute.Counts.Temps$Minute.counts ~ 1, init.theta = 1.747516571,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6951 -0.9389 -0.2977 0.4958 2.0931
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.50275 0.05115 48.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.7475) family taken to be 1)
##
## Null deviance: 278.5 on 249 degrees of freedom
## Residual deviance: 278.5 on 249 degrees of freedom
## AIC: 1746.7
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.748
## Std. Err.: 0.179
##
## 2 x log-likelihood: -1742.697
odTest(GLM.model.nb)
## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
##
## Critical value of test statistic at the alpha= 0.05 level: 2.7055
## Chi-Square Test Statistic = 1044.585 p-value = < 2.2e-16
Does the test show overdispersion?
The odTest from the pscl library compares the log-likelihood ratios of a Negative Binomial to the dispersion restriction of a Poisson regression. The null hypothesis of the odTest is that the distribution is Poisson as particular case of Negative Binomial against Negative Binomial. The null hypothesis is true when the parameter \((\theta = 1/a)\) goes to infinity b/c \(a\) approaches \(0\). The test statistic threshhold at the \(0.05\) level is \(2.7055\).
In this case, we reject the null hypothesis because our test statistic is \(~1044\), which is much greater than \(2.7055\). As a result, our p-value is extremely small, and we conclude that the the distribution is NOT Poisson as a particular case of Negative Binomial.
This third method is another way to test for overdispersion in a dataset. It focuses on whether or not we have a Poisson distribution as a special case of the negative binomial, so this method differs from the past two we tried yet yields the same result. Therefore, this test offers even more evidence for overdispersion in the dataset.
In the last section, we found clear evidence of overdispersion. Overdispersion is a sign that a process does not follow a Poisson distribution because the variance is greater than the mean of the data. In this section, we’ll explore which distributions can handle overdispersion, and we’ll find which distribution best fits the the count and intensity of malfunctions.
Kolmogorov-Smirnov test is used to test hypotheses of equivalence between two empirical distributions or equivalence between one empirical distribution and one theoretical distribution.
We’ll start with an example from the normal distribution to demonstrate how the KS test works.
sample1=rnorm(100)
sample2=rnorm(100,1,2)
Cum.Distr.Functions <- data.frame(sample1,sample2)
ecdfplot(~ sample1 + sample2, data=Cum.Distr.Functions, auto.key=list(space='right'))
We create two samples of random variables from a normal distribution but with different means and standard deviation parameters. The plot above demonstrates how the different parameters affect the ECDF of each sample.
Check equivalence of empirical distributions for the two samples.
ks.test(sample1,sample2)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: sample1 and sample2
## D = 0.28, p-value = 0.0007873
## alternative hypothesis: two-sided
What does this output tell you about equivalence of the two distributions?
The two-sample KS test checks whether two samples are sampled from the same underlying distribution. The null hypothesis is that the samples come from the same distribution, while the alternative hypothesis says they are sampled from different distributions.
In this case, we reject the null hypothesis that the samples come from the same distribution. At first, this rejection seems odd given the fact that we generate each sample with “rnorm”. The two-sample KS test does not make any assumptions about what the distribution is; rather, it checks if the samples are sampled from the same underlying distribution. In this case, KS rejects the null because both samples are both taken from the normal distribution, but not from the same normal distribution.
Check equivalence of empirical distribution of sample1 and theoretical distribution Norm(0,1).
ks.test(sample1,"pnorm",mean=0,sd=1)
##
## One-sample Kolmogorov-Smirnov test
##
## data: sample1
## D = 0.072554, p-value = 0.6684
## alternative hypothesis: two-sided
What does this output tell you?
The one-sample KS test works a bit differently. It tests whether a sample follows a theoretical distribution. Here, we test sample1 against the theoretical standard normal distribution. As expected, the ks-test fails to reject the null hypothesis. We cannot say that our sample does not come from the standard normal. This makes sense, given that we created a sample with the same parameters as the standard normal.
Check equivalence of the empirical distribution of sample2 and theoretical distribution Norm(0,1).
ks.test(sample2,"pnorm",mean=0,sd=1)
##
## One-sample Kolmogorov-Smirnov test
##
## data: sample2
## D = 0.2572, p-value = 3.589e-06
## alternative hypothesis: two-sided
What does this output tell you?
Sample2 comes from a normal with a mean = 1 and sd = 2. While still normal, sample2 is not from the standard normal, which is what we are testing it against in this one-sample test. Therefore, we reject the null hypothesis and conclude sample2 is not from the standard normal.
The findings from the one-sample tests are consistent with the test comparing sample1 to sample2. Together, these resuls demonstrate the KS test’s flexibility. It can test distributions against each other and also test one distribution against a theoretical.
Apply Kolmogorov-Smirnov test to Counting.Process$Time and theoretical exponential distribution with parameter equal to average intensity. The empirical distribution should be estimated for time intervals between malfunctions.
We can apply the KS test to our dataset as well. In our scenario, we want to model the intensity of malfunctions. This information is contained within the distribution of the expected time between each malfunction. We’ll start with the exponential disribution, which often measures the time between events in a poisson process:
# function that takes dataset and distribution to test
ks.counting.time.test <- function(d, ecdf.f){
intervals <- diff(d$Time)
mean.intensity <- mean(d$Count/d$Time)
return(ks.test(intervals, ecdf.f, mean.intensity))
}
# results from the test
(KS.Test.Event.Intervals <- ks.counting.time.test(Counting.Process, "pexp"))
##
## One-sample Kolmogorov-Smirnov test
##
## data: intervals
## D = 0.095061, p-value < 2.2e-16
## alternative hypothesis: two-sided
# Here's how the test should look like
c(KS.Test.Event.Intervals$statistic,p.value=KS.Test.Event.Intervals$p.value)
## D p.value
## 0.0950615 0.0000000
From the test results, we reject the null hypothesis and conclude that the time between malfunctions is not exponentially distributed with a mean intensity given by our estimate. The results from the test are expected because of what we already know about the overdispersion in the dataset and the relationship between the exponential and Poisson distributions. If data comes from a Poisson distribution, the time between events tends to follow the exponential distribtuion. In Part 4, we cast substantial doubt that the Poisson distribution generates observed malfunction counts per unit of time. As a result, it makes sense that the time between each malfunction is not quite exponentially distributed either.
Plot empirical cumulative distribution function for time intervals between malfunctions.
plot(ecdf(diff(Counting.Process$Time)),main="ECDF", xlab="Time Interval", ylab="CDF")
Use 5 different candidates for distribution of Poisson intensity of malfunctions. Find one-minute intensities Event.Intensities. One-minute intensity by definition is the number of events per unit of time (second).
We doubt the Poisson distribution generates counts per unit of time and seek alternatives to the exponential distribution to explain time between malfunctions. We’ll try 5 distribution candidates to pinpoint what describes the observed data the best and why.
First, let’s look at the histogram of malfunction intensities, where the unit of time is a second.
# histogram wrapper to make its data output invisible
ihist <- function(dist,...) invisible(hist(dist,...))
# calculate event intensities and their mean.
Event.Intensities <- One.Minute.Counts.Temps$Minute.counts/60
(mean.intensity <- mean(Event.Intensities))
## [1] 0.2036
(Event.Hist <- ihist(Event.Intensities))
## $breaks
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
##
## $counts
## [1] 82 68 45 26 15 7 6 1
##
## $density
## [1] 3.28 2.72 1.80 1.04 0.60 0.28 0.24 0.04
##
## $mids
## [1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75
##
## $xname
## [1] "dist"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
The mean malfunction intensity \(\lambda = 0.2036\), or about 1 malfunction every 5 seconds.
What distribution does this histogram remind you of?
The histogram above plots the frequency of each one-minute intensity rate per unite of time (second). The histogram is reminiscint of the exponential distribution, but its decay is not really “exponential”. There appears to be substantial positive skew, with quite a few high intensity periods. As a result, the right tail is far more dense than what we’d expect with a truly exponential distribution. This type of slowly decaying process is more characteristic of a gamma distribution with shape parameter between 1 and 2. When the gamma distribution’s shape paramater = 1, the distribution is the same as the exponential. When it equals 2, the first histogram bin is small, followed by a sharp increase and lingering delay as values increase along the X-axis. This histogram seems to fit somewhere in between.
The histogram’s departure from the exponential distribution is easier to visualize with an example. The first two graphs below are histograms of the counts per minute from our observed dataset and the counts per minute expected from a set of random variables with an exponential distribution that takes an intensity parameter equal to that of our sample. The last histogram overlays the previous two on the same plot. In red is the counts per minute. In purple is what we would have expected if the malfunction count data was poisson distributed with \(\lambda = 0.2036\).
set.seed(5)
# Get the same number of random exponential varaibles with intensity = rate
Theoretical.Exp <- rexp(length(Event.Intensities), mean.intensity)
# create a comparable histogram
Theoretical.Hist <- ihist(Theoretical.Exp, breaks = 4)
One.Min.Hist <- ihist(One.Minute.Counts.Temps$Minute.counts)
# plot both histograms together
plot(Theoretical.Hist, col=rgb(0,0,1,1/4), xlim = c(0,50))
plot(One.Min.Hist, col=rgb(1,0,0,1/4), xlim = c(0,50), add = T)
The final histogram helps visualize how the skew and shape of the counts do not come from a Poisson distribution. The difference in shape applies to exponential vs. gamma w/ shape > 1 as well. The mean of the Poission distribution for counts and the mean of exponential distribution for time between counts (which are inversely proportional) may be accurate representations of the dataset, but these distributions do not capture the way the data varies. The histograms demonstrate that intense periods (or shorter windows between events) occur more often than expected.
This visualization is in line with the results we’ve collected from tests in previous sections, but we cannot make conclusions about the correct distributions from plots alone. In the upcoming sections, we’ll perform quantiative tests to determine the proper distribution and subsequently asess whether or not the gamma hypothesis is true.
Suggest 5 candidates for the distribution. Fit each of you 5 candidate distributions to Event.Intensities using fitdistr() from MASS. Start with fitting normal and exponential distributions first.
We are already aware that the normal distribution and exponential distribution are not the correct fits. That being said, we’ll include their fit (or lack thereof) as a reference point. Using these distributions gives us a chance to introduce the “fitdistr” method, which we will use to find the optimal distribution for our observed dataset.
# Normal fit
(Fitting.Normal <- fitdistr(Event.Intensities, "normal"))
## mean sd
## 0.203600000 0.158227459
## (0.010007183) (0.007076147)
plotdist(Event.Intensities,"norm",para=list(mean=Fitting.Normal$estimate[1],sd=Fitting.Normal$estimate[2]))
# Exponential fit
(Fitting.Exponential <- fitdistr(Event.Intensities, "exponential"))
## rate
## 4.9115914
## (0.3106363)
plotdist(Event.Intensities,"exp",para=list(rate=Fitting.Exponential$estimate[1]))
# Normal KS
KS.Normal <- ks.test(Event.Intensities, "pnorm",
Fitting.Normal$estimate[1], Fitting.Normal$estimate[2])
c(KS.Normal$statistic,P.Value=KS.Normal$p.value)
## D P.Value
## 0.1326020316 0.0003039941
# Exponential KS
KS.Exp <- ks.test(Event.Intensities, "pexp",
Fitting.Exponential$estimate[1], Fitting.Exponential$estimate[2])
c(KS.Exp$statistic,P.Value=KS.Exp$p.value)
## D P.Value
## 0.115233049 0.002615812
What do you conclude from these tests?
The fitdistr method provides estimates for parameters given a dataset and theoretical distribution. The plotdist method visualizes what the fitdistr method has essentially done. From the plotdist histogram and qqplot, we see right away that the normal distribution is not a good fit. The exponential distribution improves upon the fit, but the affect of the positive skew is revealed. The KS tests provide quantitative support. In both cases, the KS test rejects the null hypothesis that the “Event.Intensities” sample is normally / exponentially distributed. From the combination of qualitative and quantitative testing, we can safely assume we need to fit additional distributions.
Try to fit gamma distribution directly using fitdistr():
# GENERATES AN ERROR, SO RMARKDOWN WON'T KNIT TO HTML
# (Fitting.Gamma <- fitdistr(Event.Intensities, "gamma"))
What results do you obtain when you fit gamma distribution using fitdistr()?
The gamma distribution works with positive, non-zero values only. Therefore, the fitdistr method thows an error when we pass a sample that includes zeros. The one-minute malfunctions count data has 7 minutes for which the count is zero, so the intensity per time unit for these minutes is also zero. This is an invalid sample for the fitdistr method if we specify the gamma distribution, and as a result, fitdistr throws an error.
(Fitting.Gamma.No.Zero <- fitdistr(Event.Intensities[-which(Event.Intensities==0)], "gamma"))
## shape rate
## 1.7258099 8.2052264
## (0.1442838) (0.7948451)
The “Fit.Gamma.No.Zero” variable demonstrates that the fitdistr function works when we do not have zero intensity. While we do not use these results, the code above shows how we can make the fitdistr function work with gamma.
The gamma distribution takes two parameters:Estimate parameters of gamma distribution using method of moments.
Gamma relies on both parameters to calculate mean and variance. Assuming random variable \(X \sim Gamma(\alpha, \beta)\):
We also know that \(\lambda = \frac{1}{\beta}\), so we can estimate rate and shape of gamma in terms of its moments (steps straightforward and excluded for brevity).
var.intensity <- var(Event.Intensities)*(length(Event.Intensities)-1)/length(Event.Intensities)
(Moments.Rate <- mean.intensity/var.intensity)
## [1] 8.132313
(Moments.Shape <- mean.intensity^2/var.intensity)
## [1] 1.655739
The rate and shape parameters are estimated from the sample mean and variance under a gamma distribution. They look promising without any tests. From the histogram, we postulated a gamma distribution with \(1 < \alpha < 2\). The estimate returned here is \(\alpha = 1.655739\).
Next, we’ll assess how well the gamma distribution with our estimates fits our malfunction dataset.
Check gamma with these parameters as a theoretical distribution using Kolmogorov-Smirnov test.
We have the expected rate for our sampled gamma distribution, and the shape is in line with our initial expectations from the histogram plot. We use one-sample KS to test the sampled distribution against the theoretical gamma with rate and shape parameters equal to our estimates.
plotdist(Event.Intensities,"gamma",para=list(shape=Moments.Shape,rate=Moments.Rate))
c(Moments.Shape, Moments.Rate)
## [1] 1.655739 8.132313
(KS.Moments <- ks.test(Event.Intensities, "pgamma", Moments.Shape, Moments.Rate))
##
## One-sample Kolmogorov-Smirnov test
##
## data: Event.Intensities
## D = 0.05781, p-value = 0.3736
## alternative hypothesis: two-sided
The results from the gamma distribution are the best we’ve seen yet. Fitdistr shows how well the gamma distribution fits. In the top left plot, the theoretical density crosses through the midpoint of almost every histogram bin, and the density correctly accounts for the positive skew in the Event.Intensity data. The Q-Q plot is also a huge improvement from the exponential. This improvement stems from the gamma’s freely changing shape parameter, \(\alpha\). When \(\alpha = 1\), the gamma distribution takes the form of the exponential distribution. When \(a>1\), the gamma distribution converges more and more to a normal shape with fatter tails. Our gamma fit has \(\alpha = 1.655739\), which seems to fit extremely well.
Lastly, the empirical CDF plot is the best fit we’ve seen yet. The KS test supports this observation more rigorously. The KS test for the one-sample gamma fails to reject the null hypothesis. There is not enough support to conclude that the distribution is not gamma.
Find at least 2 more candidates and test them by Kolmogorov-Smirnov.
The gamma distribution provided great results, but there are similar distributions to gamma that we still need to try.
The Log Normal Distribution
We start with the lognormal distribution. The lognormal explains a random variable whose logarithm is normally distributed. The distribution is very similar to the gamma distribution, and the lognormal often models time lengths as well, especially related to human behavior. Its main difference from gamma: the log of a lognormal random variable is normally distributed, while the log of a gamma random variable is left skewed.
The lognormal distribution is continuous but takes positive values only, so we run into the same issue fitting the lognormal that we did with gamma. Because we cannot use the fitdist method, we must estimate the parameters for mean and standard deviation on our own. We do so in the code below, using the Event Intensities data and estimating the log-mean and log-sd:
# LOG NORMAL EXAMPLE
eim <- mean(Event.Intensities)
eiv <- var(Event.Intensities)
LogNormal.Mu <- log((eim^2)/sqrt(eiv+eim^2))
LogNormal.sigma <- sqrt(log(eiv/(eim^2)+1))
plotdist(Event.Intensities, "lnorm",para=list(meanlog=LogNormal.Mu,sdlog=LogNormal.sigma))
(KS.Candidate.4 <- ks.test(Event.Intensities, "plnorm", LogNormal.Mu, LogNormal.sigma))
## Warning in ks.test(Event.Intensities, "plnorm", LogNormal.Mu,
## LogNormal.sigma): ties should not be present for the Kolmogorov-Smirnov
## test
##
## One-sample Kolmogorov-Smirnov test
##
## data: Event.Intensities
## D = 0.13127, p-value = 0.0003623
## alternative hypothesis: two-sided
The lognormal results are subpar. The fit is better than that of the normal and exponential, but there are clear deficiencies in comparsion to gamma. The log normal has the general shape and skew we want, but its peak density is too high, and its right tail drops off too quickly. Additionally, it does not pass the KS-test. The D-statistic and associated p-value lead us to reject the null hypothesis and conclude that the lognormal is not the right distribution for our data.
The Quasi-Poisson Distribution
We’ve continuously explored the poission distribution and noted it cannot handle overdispersion. Therefore, we know the poisson distribution will not fit better than gamma. But the GLM method allows us to fit a family called “quasi-poisson” which relaxes the assumption of mean = variance. So in theory, the quasi poisson sounds like a reasonable choice.
Unfortunately, the quasi-Poisson model is not a likelihood model but a quasi-likelihood model. Therefore, we have to find a workaround to fit its “distribution”. With quasi-Poission, we still specify mean and variance but leave likelihood unspecified. Therefore, there is no density function with predicted probabilities for each count in quasi-poisson.
We will work around this issue by fitting glm(y ~ 1), which essentially computes the mean of a process plus a dispersion parameter based on the residual sum of squares:
Quasi.Glm <- summary(glm(One.Minute.Counts.Temps$Minute.counts ~ 1, family=quasipoisson))
Quasi.Rate <- Quasi.Glm$dispersion
Quasi.Shape <- Moments.Shape
plotdist(Event.Intensities,"gamma", para=list(shape=Quasi.Shape,rate=Quasi.Rate))
(KS.Candidate.5 <- ks.test(Event.Intensities, "pgamma", Quasi.Shape, Quasi.Rate))
## Warning in ks.test(Event.Intensities, "pgamma", Quasi.Shape, Quasi.Rate):
## ties should not be present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: Event.Intensities
## D = 0.084184, p-value = 0.05782
## alternative hypothesis: two-sided
Essentially, this strategy fine-tunes the parameters and fits a different gamma. The rationale is that the quasi-poisson dispersion parameter may serve as a better rate than what we derived for gamma originally. First, we model a null glm with quasi poission in an attempt to get the dispersion parameter from the counts. We save the dispersion parameter as the rate. Because we are still testing against the Gamma distribution (since we know poisson will not work), we reuse the same mean from before, but we pass a different rate.
The results are quite interesting. Visually, the fit looks like it does a better job. The KS test also technically fails to reject the null hypothesis, but the p-value is right on the border. Therefore, we are far less confident in this fit for our data even if the distribution looks visually appealing.
We have a gamma that works already, and we are very confident about the result. So there’s no real reason to use a more complicated strategy to back into another version of gamma unless it’s a substantially better fit. Therefore, we’ll stick with the gamma we have.
Collect all estimated distributions together and make your choice.
In the code below, we collect the results of the 5 distributions we fit. We bind together their one-sample KS test statistic (\(D\)) and associated p-values so that we can compare the results from each.
round(rbind(KS.Moments=c(KS.Moments$statistic,P.Value=KS.Moments$p.value),
KS.Candidate.4=c(KS.Candidate.4$statistic,P.Value=KS.Candidate.4$p.value),
KS.Candidate.5=c(KS.Candidate.5$statistic,P.Value=KS.Candidate.5$p.value),
KS.Exp=c(KS.Exp$statistic,P.Value=KS.Exp$p.value),
KS.Normal=c(KS.Normal$statistic,KS.Normal$p.value)),7)
## D P.Value
## KS.Moments 0.0578105 0.3736123
## KS.Candidate.4 0.1312724 0.0003623
## KS.Candidate.5 0.0841841 0.0578236
## KS.Exp 0.1152330 0.0026158
## KS.Normal 0.1326020 0.0003040
What distribution for the one-minute intensity of malfunctions do you choose?
The gamma distribution is the best representation of the one-minute intensity for malfunctions. When looking at the entire dataset as a whole, we notice substantial overdispersion present. This overdispersion means the variability of the counts is more than the poisson distribution can capture. As a result, the exponential distribution will underestimate the likelihood of high-intensity minutes. This underestimation stems from the exponentials relationship with the Poisson distribution. Exponential is not able to handle non-constant time intervals, as it assumes that the rate \(\lambda\) is constant.
The gamma distribution distribution’s rate and shape have the highest likelihood of producing the one-minute intensities from the dataset. Therefore, gamma is the best distribution we were able to fit to the data. It has the least significant KS Test statistic, and it is the only distribution in which we are confident as a good approximation. It handles the overdispersion well, and it is the model we select for one-minute intensities.
What distribution of one-minute malfunctions counts follow from your choice?
The negative binomial distribution is the best representation of the one-minute malfunction counts. The poisson distribution assumes that the mean and variance of the observed data are equal. In this assignment, we have tirelessly tested this assumption, and we have proven that it does not hold for our dataset. We need a distribution that can handle overdispersion for counts, and that distirbution is the negative binomial. The Poisson distribution is a special, restricted case of the negative binomial - one in which no overdispersion exists. Therefore, our poisson process, which has overdispersion, is best approximated by the negative binomial.
Write One.Minute.Counts.Temps to file OneMinuteCountsTemps.csv to continue working on Part 2.
write.csv(One.Minute.Counts.Temps,file="OneMinuteCountsTemps.csv",row.names=FALSE)
In Part 1, we investigate which probability distribution explains both the time between malfunctions and the per-minute malfunction count. The results help the company understand how frequently malfunctions occur and predict when they may arrive in the future. That being said, the distributions do not explain why malfunctions occur in the first place, or what may cause them.
In this part, we explore the relationship between malfunctions and the temperature at which they occur.
First, let’s prepare the data. In the code below, we:Explore possible types of dependence between one-minute counts and temperature.
# ---------------------- 2.2.1 ---------------------- #
Part2.Data<-read.csv("OneMinuteCountsTemps.csv", header = T)
head(Part2.Data)
## Minute.times Minute.counts Minute.Temps
## 1 30 7 91.59307
## 2 90 10 97.30860
## 3 150 7 95.98865
## 4 210 4 100.38440
## 5 270 1 99.98330
## 6 330 6 102.54126
dim(Part2.Data)
## [1] 250 3
# ---------------------- 2.2.2 ---------------------- #
complete.cases <- with(Part2.Data, which(Minute.counts > 0))
Part2.Data<-Part2.Data[complete.cases(Part2.Data),]
dim(Part2.Data)
## [1] 242 3
Part2.Data<-as.data.frame(cbind(Part2.Data,Part2.Data[,2]/60))
colnames(Part2.Data)<-c("Times","Counts","Temperatures","Intensities")
# ---------------------- 2.2.3 ---------------------- #
plot(Part2.Data$Temperatures,Part2.Data$Intensities)
The scatterplot above shows the relationship between the one-minute malfunction intensity rates and their respective temperature over the course of the minute. There is certainly a positive relationship between intensity and temperature, but we must be careful how we define that relationship. Intensity’s lower theoretical bound is zero, and in this case, it’s greater than zero because we have removed all minutes where intensity = 0. This constraint yields a nonlinear relationship between the raw values of intensities and temperature. In general, the malfunction rate hovers between 0 and 0.2 when temperatures are anywhere below 100. But when temperature breaks the 100-degree threshold, the intensity starts increasing with temperature in a roughly linear manner.
To further demonstrate this positive but nonlinear relationship, let’s model a simple OLS:
par(mfrow=c(2,3))
lm.intemp <- lm(Intensities~Temperatures, data = Part2.Data)
plot(lm.intemp)
plot(density(resid(lm.intemp)))
All of these plots reveal that a non-consant rate of change between temperature and intensity. The nonlinearity is not symmetric; it arises from very high temperature values only. The fat right-tail in the density plot and the one-sided skew in the QQ plot visualize this nonsymmetry the best.
Analyze empirical copula.
We have established a nonlinear trend between intensities and temperature, but they also have a clear positive relationship. Let’s plot the ranks of both Temperature and Intensities:
# ---------------------- 2.2.5 ---------------------- #
plot(rank(Part2.Data$Temperatures),rank(Part2.Data$Intensities))
What type of dependency you see in the empirical copula?
Note that a rank of 1 corresponds to the lowest Temperature / Intensity. High temperatures and intensities, on the other hand, have a “low” rank in the 200’s to 250’s. Also, Temperature and Intensity now have the same range, upper bound, and lower bound given that each data point must take on a value between 1 and 250.
These new constraints change the relationship between Temperature and Intensity. First, transforming each variable to a bounded scale and range removes the effect of the each original variable’s shape and distribution. As a result, the positive trend between the two variables is considerably less pronounced in the plot, but so is the nonlinearity.
The fat-tail from the first plot stems from the postitive relationship between Temperature and Intensity once Temperature crosses the 100-degree threshhold. In the rank plot, this relationship forms in the upper-right-hand corner of the plot. The ranks for temperature and intensity cluster together, and the ranks for each variable follow eachother because the underyling data is positively correlated in this area. In the empirical copula plot, this top-right clustering is known as upper tail dependence .
What is the distribution of temperatures?
In the code below, we look at the distribution of temperature. We then fit the distribution and test its fit against its theoretical. To do so, we use the familiar “fitdistr” and “ks.test” methods.
hist(Part2.Data$Temperatures)
# ---------------------- 2.2.6 ---------------------- #
(Fitting.Temp.Normal <- fitdistr(Part2.Data$Temperatures, "normal"))
## mean sd
## 100.0698530 4.8124839
## ( 0.3093582) ( 0.2187493)
plotdist(Part2.Data$Temperatures,"norm",
para=list(mean=Fitting.Temp.Normal$estimate[1],
sd=Fitting.Temp.Normal$estimate[2]))
# ---------------------- 2.2.7 ---------------------- #
(KS.Temp <- ks.test(Part2.Data$Temperatures,"pnorm",
mean = Fitting.Temp.Normal$estimate[1],
sd = Fitting.Temp.Normal$estimate[2]))
##
## One-sample Kolmogorov-Smirnov test
##
## data: Part2.Data$Temperatures
## D = 0.048912, p-value = 0.6089
## alternative hypothesis: two-sided
The histogram plot visualizes the distribution of the Temperatures. Overall, the distribution appears to be normal. There’s some positive skew, but the sample size is also quite small (250 one-minute periods). The fitdistr and plotdistr functions make us more comfortable in Temperature’s normality. The plots show that the normal distribution fits very well using Temperature’s mean and standard deviation estimates. Lastly, the KS-test rejects the null hypothesis. There is no strong evidence that the distribution is not normal.
We now explore the dependency between Temperature and Intensity. Because the relationhship is nonlinear, we cannot rely on traditional correlation metrics. Instead, we utlilze copulas to model the dependence bewteen each random variable.
# ---------------------- 2.2.8 ---------------------- #
copula.cols <- Part2.Data[,3:4]
gumbel.cop<-gumbelCopula(param=5,dim=2)
Copula.Fit<-fitCopula(gumbel.cop,
pobs(copula.cols,ties.method = "average"),
method = "ml",
optim.method = "BFGS",
optim.control = list(maxit=1000))
summary(Copula.Fit)
## Call: fitCopula(copula, data = data, method = "ml", optim.method = "BFGS",
## optim.control = ..3)
## Fit based on "maximum likelihood" and 242 2-dimensional observations.
## Gumbel copula, dim. d = 2
## Estimate Std. Error
## alpha 1.877 0.099
## The maximized loglikelihood is 74.18
## Optimization converged
## Number of loglikelihood evaluations:
## function gradient
## 11 4
We fit the data to a Gumbel copula. The Gumbel copula handles upper-tail dependence well. In our case, upper-tail dependence results from positive relationship between temperature and intensity after temperature hits the 100-degree threshhold.
Next, we simulate a copula using the estimates from the Gumbel copula fit.
par(mfrow=c(2,2))
simulate.copula <- function(est.cop, est.num, cop.type = gumbelCopula, sed = 8301735){
set.seed(sed)
sc <- cop.type(param=est.cop@estimate,dim=2)
persp(sc, dCopula, main="pdf",xlab="u", ylab="v", zlab="c(u,v)")
contour(sc, dCopula, main="pdf",xlab="u", ylab="v")
scc <- rCopula(est.num, sc)
sim.n <- length(scc[,1])
plot(scc,main="Simulated Copula",xlab="Variable 1",ylab="Variable 2")
plot(apply(scc,2,rank)/sim.n,main="Empirical Copula",xlab="Variable 1",ylab="Variable 2")
title("Copula.Fit",outer=TRUE,line=-2)
return(scc)
}
# ---------------------- 2.2.9 ---------------------- #
sim.250 <- simulate.copula(Copula.Fit, 250)
We simulate a copula with 250 random variables using our initial Copula Fit estimates. The simulated and empirical copula appear above. In the top right corner, we observe the pdf of the simulation. The larger circles in the upper right of that plot correspond to the upper-tail dependence we have noticed throughout the last few steps.
Next, we simulate 5000 random variables using the dependency structure from our Copula above. Then, we use the quantile function to convert the copula samples into simulated temperatures and intensities. To do so, we specify the normal distribution for temperature and the gamma distribution for intensities. In part 1, we select the gamma distribution as the best model for intensity, and in this assignment, we find that the normal distribution best approximate temperature.
# ---------------------- 2.2.10 ---------------------- #
set.seed(8301735)
sc <- gumbelCopula(param=Copula.Fit@estimate,dim=2)
sim.copula <- rCopula(5000, sc)
# gamma distributed with shape and rate using estimates from part 1 intensities
Simulated.Intensities <- qgamma(sim.copula[,1],1.655739,8.132313)
# normally distributed with mean and variance using estimates from temps
Simulated.Temperature <- qnorm(sim.copula[,2],Fitting.Temp.Normal$estimate[1],Fitting.Temp.Normal$estimate[2])
# ---------------------- 2.2.11 ---------------------- #
plot(Simulated.Temperature,Simulated.Intensities)
When we use the copula to simulate temperature and intensities, we preseve the dependence between the two that we initially observed. We then simulate thousands of additional random variables to make the relationship more apparent at scale. The plot above captures this relationship. Again, we see the pick-up in intensities once temperatures cross the ~100 threshhold.
Next, we plot the empirical copula:
# ---------------------- 2.2.12 ---------------------- #
plot(rank(Simulated.Temperature),rank(Simulated.Intensities))
The empirical copula plot of the simulated temperature-intensity pairs reveals the tail dependence we noted from the observed data.
The Gumbel Copula helps us understand the dependency between temperature and intensities. We note that there is a clear relationship between the two, although that relationship is nonlinear. Intensity’s rate of change is not constant as temperature increases. We observe a linear positive relationship between the two variables when temperature crosses a specific threshhold.
We now reason which regression we can use to explain how intensity changes in response to temperature changes. Because we know the dataset contains overdispersion, we start with the negative binomial model:
# ---------------------- 2.2.13 ---------------------- #
NB.Fit.To.Sample <- glm.nb(Intensities*60~Temperatures, data = Part2.Data)
NB.Fit.To.Sample$coefficients
## (Intercept) Temperatures
## -7.43137538 0.09843241
NB.Fit.To.Sample$deviance
## [1] 257.1937
NB.Fit.To.Sample$df.residual
## [1] 240
NB.Fit.To.Sample$aic
## [1] 1557.817
In the code above, we compute the Negative Binomial fit for our observed dataset. We show the coefficient for Temperature, the model’s deviance vs. degrees of freedom, and the model’s AIC. We will comment on the goodness-of-fit for this model shortly.
Next, we create the simulated sample for tail events. In this case, we define tail events as those with temperature over 110 and intensity rate over 0.5. These pairs represent the strongest upper-tail dependency in our copula plots. We apply these constraints to our dataset and plot the result:
Simulated.Tails<-as.data.frame(
cbind(round(Simulated.Intensities[(Simulated.Temperature>110)&(Simulated.Intensities>.5)]*60),
Simulated.Temperature[(Simulated.Temperature>110)&(Simulated.Intensities>.5)]))
colnames(Simulated.Tails)<-c("Counts","Temperatures")
# ---------------------- 2.2.14 ---------------------- #
plot(Simulated.Tails$Temperatures,Simulated.Tails$Counts)
We plot the simulated temperature and counts within the upper-right tail. The main purpose of simulation is to generate additional points that have the same dependency structure as our original data. Our original data did not have enough observations in the upper tail (and only had 250 observations total). Our simulated data, however, gives us a large enough sample to understand the relationship between temperature and malfunction count in the upper tail.
Fit negative binomial model to the tail observations Simulated.Tails.
Now that we’ve isolated the tail data, we regress Temperatures on Counts again. We observe any differences between the tails from our simulated data and the initial fit we created on the observed data:
# ---------------------- 2.2.15 ---------------------- #
Simulated.Tails.NB <- glm.nb(Counts~Temperatures, data = Simulated.Tails)
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
# ---------------------- 2.2.16 ---------------------- #
summary(NB.Fit.To.Sample)
##
## Call:
## glm.nb(formula = Intensities * 60 ~ Temperatures, data = Part2.Data,
## init.theta = 4.202611757, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5358 -0.9162 -0.1509 0.4795 3.2190
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.431375 0.785456 -9.461 <2e-16 ***
## Temperatures 0.098432 0.007793 12.631 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.2026) family taken to be 1)
##
## Null deviance: 431.60 on 241 degrees of freedom
## Residual deviance: 257.19 on 240 degrees of freedom
## AIC: 1557.8
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.203
## Std. Err.: 0.549
##
## 2 x log-likelihood: -1551.817
summary(Simulated.Tails.NB)
##
## Call:
## glm.nb(formula = Counts ~ Temperatures, data = Simulated.Tails,
## init.theta = 385981.6374, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7586 -0.7146 0.0311 0.5559 3.1897
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.9678 1.2887 -6.183 6.3e-10 ***
## Temperatures 0.1050 0.0115 9.127 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(385981.6) family taken to be 1)
##
## Null deviance: 158.968 on 83 degrees of freedom
## Residual deviance: 80.034 on 82 degrees of freedom
## AIC: 556.74
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 385982
## Std. Err.: 8986395
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -550.742
To analyze the goodness-of-fit for each regression, let’s quickly review the purpose of the negative binomial in relation to poisson. In poisson, \(\lambda = \mu = \sigma^2\) (i.e. the intensity of a poisson process is its mean and variance). This equality means data from poisson distribution has variance proportional to the mean. In reality, however, this relationship does not always occur. Generally, the variance in a dataset exceeds the mean, and overdispersion occurs. This overdispersion is an important part of the data’s variability, so a poisson model underestimates variance if overdispersion exists. To model overdispersion, the negative binomial distribution redefines the relationship between the mean and the variance of a process. In negative binomial: \[\mu = \mu;\sigma^2 = \mu + \frac{\mu}{\theta}\] In essence, the negative binomial allows variance to be larger than the mean. Significant overdispersion exists when \(\theta\) is small. When \(\theta\) is large, the negative binomial converges to the poisson distribtuion.
So far, we’ve fit two negative binomial regressions. The first uses every observation in our original dataset. The second uses simulated data in the upper tail, where the relationship between temperature and intensity becomes more defined. In the first case, the residual deviance is a bit larger than the degrees of freedom, and the AIC is about double. The coefficients are about the same. The first model handles data that has far more variability, and thus the model is not as performant. That being said, use of the negative binomial is justied in the first case given the overdispersion, as discussed below:
What do the fitted parameters θ tell you about both models
In the first model, \(\theta = 4.203\). This number is quite small, which means that the Negative Binomial overdispersion measure \(\frac{\mu}{\theta}\) is significant, and the variance is larger than the mean. In the second model (the upper-tail model), \(\theta = 385982\), which is quite large. This parameter essentially wipes out the overdispersion measure. \(\sigma^2 = \mu + \frac{\mu}{385982} \rightarrow \mu + 0 = \mu\). Therefore, the second model for the upper tail does not have overdispersion . The negative binomial may not be the best choice in this case.
Is there an alternative model that you would try to fit to the simulated tail data?
The simulated tail model has a very large value for \(\theta\), which means the mean of the distribution essentially equals the variance. This relationship signifies that a poisson model is more appropriate. We can test this hypothesis to confirm a poisson model is a better selection. For completeness, let’s test both the full model and the simulated tails model.
odTest(NB.Fit.To.Sample)
## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
##
## Critical value of test statistic at the alpha= 0.05 level: 2.7055
## Chi-Square Test Statistic = 304.951 p-value = < 2.2e-16
odTest(Simulated.Tails.NB)
## Likelihood ratio test of H0: Poisson, as restricted NB model:
## n.b., the distribution of the test-statistic under H0 is non-standard
## e.g., see help(odTest) for details/references
##
## Critical value of test statistic at the alpha= 0.05 level: 2.7055
## Chi-Square Test Statistic = -5e-04 p-value = 0.5
The odTest hypothesis states that poisson is appropriate as a restricted NB. This means that theta should be very large, which occurs when the poisson disribution is a better approximation of the data. We can think of the poisson model as a restricted version of the Negative Binomial, given that the Negative Binomial is an extension of Poisson and equal to Poisson in special cases (when \(\theta\) heads to infinity).
The first odTest for the full dataset rejects the null hypothesis. Overdispersion exists, and therefore the Poisson distribution is not appropriate. In the second odTest, however, the null hypothesis is not rejected. The model is well approximated by a poisson distribution, so we cannot reject the null hypothesis. Therefore, the negative binomial we selected may not be the best model. These tests provide statistical evidence in favor of the poisson distribution to model the simulated tail instead.
What do both models tell you about the relationships between the temperature and the counts?
The models reveal that the dependency and relationship between temperature and counts is complex and cannot be approximated by one distribution only. If we take the entire process, a negative binomial distribution is the most appropriate, because overdispersion exists. But that overdispersion is a result of random variance between temperature and malfucntion count that may not be interesting to us or to the company. Below 100 degrees, malfunction intensity varies independently of temperature, with little relationship between the two. After 100 degrees, malfunction intensity has a tight, positive relationship with temperature. As the temperature rises, the malfunction count rises as well, and the distribution that models the two is best approximated by Poisson.
Therefore, our process for counts is a mixture of negative binomial and poisson, while our intensity is a mixture of gamma and exponential. Poission-count and exponential-intensity pick up once temperature crosses a certain threshhold.
Fit poisson model to Simulated.Tails$Counts and compare the fit with NB for Part2.Data.
Finally, let’s fit the Poisson process to the Simulated tails:
# ---------------------- 2.2.20 ---------------------- #
Poisson.Fit <- glm(Counts~Temperatures, data = Simulated.Tails, family = "poisson")
summary(Poisson.Fit)
##
## Call:
## glm(formula = Counts ~ Temperatures, family = "poisson", data = Simulated.Tails)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7587 -0.7147 0.0311 0.5559 3.1899
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.9678 1.2886 -6.183 6.28e-10 ***
## Temperatures 0.1050 0.0115 9.128 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 158.987 on 83 degrees of freedom
## Residual deviance: 80.043 on 82 degrees of freedom
## AIC: 554.74
##
## Number of Fisher Scoring iterations: 4
Poisson.Fit$deviance
## [1] 80.04321
Poisson.Fit$df.residual
## [1] 82
Poisson.Fit$aic
## [1] 554.7414
We see that the model fit is quite similar to what the negative binomial returns. The coefficients, residual deviance, and degrees of freedom are all roughly the same. This similarity occurs because the Poisson Distribution is a special case of the Negative Binomial, where dispersion is roughly equal to 1, so the mean and variance are equal.
The main difference between the poisson and negative binomial fits is the way in which each maximum likelihood algorithm computes. The negative binomial distribution must compute an estimate for \(\theta\), which in this case is extremely high. That being said, the theta parameter from the negative binomial is not “infinity”, so even though the limit for the variance approaches mean, it is not exactly the mean. Therefore, the poisson process produces a lower AIC. The difference is quite small, but it still apparent. Because the AIC uses the log-likelihood in its computation, the a true poisson process (such as the one we have in our tail) is fit best by the Poisson Distribution.
We conclude that the relationship between temperature and malfunction count comes from a mixture of distributions, depending on the temperature degree. The Negative Binomial fits the process best when we consider the entire dataset. That being said, the portion we care the most about - when malfunction counts per minute are rising - is fit best using the Poisson distribution, and the relationship between temperature and malfunctions is best described by a Possion regression model.
The company now knows the relationship between temperature and malfunction count. Keeping the temperature under the 100-degree threshhold is important. The company should be aware that malfunctions occur regardless of temperature, but they become problematic once temperature rises to certain limits. So long as the company accepts the average rate of malfunctions below 100 degrees, the company should be able to remain efficient.